INN Hotels Group has a chain of hotels in Portugal, they are facing
problems with the high number of booking cancellations. In the hotel
industry, effective management of bookings is crucial for optimizing
revenue and ensuring operational efficiency.
Our main objective in this project is to analyze the dataset to find
which factors have a high influence on booking cancellations while
answering some SMART questions. By examining customer attributes such as
room preferences, lead time, meal plan choices, and past booking
behavior, we will identify key patterns and trends. Additionally, we
will also analyze other variables such as special requests, arrival
month and market segments.
1. Do customers who make special requests cancel
their bookings less frequently than those who don’t?
2. Are guests with no previous cancellations more
likely to avoid canceling their current booking?
3. What are the key factors that show the strongest
correlation with booking cancellations?
4. How do hotel cancellation rates change across
different seasons (spring, fall, winter, summer, holidays, low season),
and which factors (e.g., lead time, room type/) correlate most strongly
with cancellations during each season?
5. What are the differences in cancellation rates based
on meal plan type, lead time, room type, and market segment, and what
are the 95% confidence intervals for these rates?
For this project, we have selected a hotel booking dataset containing over 36,000 records of bookings. [INNHotelsGroup dataset](https://www.kaggle.com/datasets/mariyamalshatta/inn-hotels-group) from Kaggle. This dataset offers a comprehensive overview of hotel booking patterns, making it ideal for our analysis on cancellations and no-shows.
hotel_data <- read.csv("../Dataset/INNHotelsGroup.csv")
This dataset contains 36275 observations of 19 variables. Out of these observations, 0 rows contain null values.
str(hotel_data)
## 'data.frame': 36275 obs. of 19 variables:
## $ Booking_ID : chr "INN00001" "INN00002" "INN00003" "INN00004" ...
## $ no_of_adults : int 2 2 1 2 2 2 2 2 3 2 ...
## $ no_of_children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_weekend_nights : int 1 2 2 0 1 0 1 1 0 0 ...
## $ no_of_week_nights : int 2 3 1 2 1 2 3 3 4 5 ...
## $ type_of_meal_plan : chr "Meal Plan 1" "Not Selected" "Meal Plan 1" "Meal Plan 1" ...
## $ required_car_parking_space : int 0 0 0 0 0 0 0 0 0 0 ...
## $ room_type_reserved : chr "Room_Type 1" "Room_Type 1" "Room_Type 1" "Room_Type 1" ...
## $ lead_time : int 224 5 1 211 48 346 34 83 121 44 ...
## $ arrival_year : int 2017 2018 2018 2018 2018 2018 2017 2018 2018 2018 ...
## $ arrival_month : int 10 11 2 5 4 9 10 12 7 10 ...
## $ arrival_date : int 2 6 28 20 11 13 15 26 6 18 ...
## $ market_segment_type : chr "Offline" "Online" "Online" "Online" ...
## $ repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_price_per_room : num 65 106.7 60 100 94.5 ...
## $ no_of_special_requests : int 0 1 0 0 0 1 1 1 1 3 ...
## $ booking_status : chr "Not_Canceled" "Not_Canceled" "Canceled" "Canceled" ...
Data Dictionary
For our project, we are interested in majority of the variables but there are some irrelevant columns with respect to our objective and thus we will drop them.
hotel_data_clean <- hotel_data[, c("no_of_adults", "no_of_children", "no_of_weekend_nights", "no_of_week_nights", "type_of_meal_plan", "required_car_parking_space", "room_type_reserved", "lead_time", "arrival_year", "arrival_month", "market_segment_type", "repeated_guest", "no_of_previous_cancellations", "no_of_special_requests", "booking_status",
"avg_price_per_room")]
hotel_data_clean <- na.omit(hotel_data_clean)
After cleaning, we are left with 36275 observations of 16 variables.
write.csv(hotel_data_clean,"../Dataset/INNHotelsGroup_min.csv", row.names = F)
hotel_data_clean$type_of_meal_plan <- as.factor(hotel_data_clean$type_of_meal_plan)
hotel_data_clean$room_type_reserved <- as.factor(hotel_data_clean$room_type_reserved)
hotel_data_clean$booking_status <- as.factor(hotel_data_clean$booking_status)
hotel_data_clean$market_segment_type <- as.factor(hotel_data_clean$market_segment_type)
hotel_data_clean$arrival_year <- as.factor(hotel_data_clean$arrival_year)
hotel_data_clean$arrival_month <- as.factor(hotel_data_clean$arrival_month)
summary(hotel_data_clean)
## no_of_adults no_of_children no_of_weekend_nights no_of_week_nights
## Min. :0.00 Min. : 0.00 Min. :0.00 Min. : 0.0
## 1st Qu.:2.00 1st Qu.: 0.00 1st Qu.:0.00 1st Qu.: 1.0
## Median :2.00 Median : 0.00 Median :1.00 Median : 2.0
## Mean :1.84 Mean : 0.11 Mean :0.81 Mean : 2.2
## 3rd Qu.:2.00 3rd Qu.: 0.00 3rd Qu.:2.00 3rd Qu.: 3.0
## Max. :4.00 Max. :10.00 Max. :7.00 Max. :17.0
##
## type_of_meal_plan required_car_parking_space room_type_reserved
## Meal Plan 1 :27835 Min. :0.000 Room_Type 1:28130
## Meal Plan 2 : 3305 1st Qu.:0.000 Room_Type 2: 692
## Meal Plan 3 : 5 Median :0.000 Room_Type 3: 7
## Not Selected: 5130 Mean :0.031 Room_Type 4: 6057
## 3rd Qu.:0.000 Room_Type 5: 265
## Max. :1.000 Room_Type 6: 966
## Room_Type 7: 158
## lead_time arrival_year arrival_month market_segment_type
## Min. : 0 2017: 6514 10 : 5317 Aviation : 125
## 1st Qu.: 17 2018:29761 9 : 4611 Complementary: 391
## Median : 57 8 : 3813 Corporate : 2017
## Mean : 85 6 : 3203 Offline :10528
## 3rd Qu.:126 12 : 3021 Online :23214
## Max. :443 11 : 2980
## (Other):13330
## repeated_guest no_of_previous_cancellations no_of_special_requests
## Min. :0.000 Min. : 0.00 Min. :0.00
## 1st Qu.:0.000 1st Qu.: 0.00 1st Qu.:0.00
## Median :0.000 Median : 0.00 Median :0.00
## Mean :0.026 Mean : 0.02 Mean :0.62
## 3rd Qu.:0.000 3rd Qu.: 0.00 3rd Qu.:1.00
## Max. :1.000 Max. :13.00 Max. :5.00
##
## booking_status avg_price_per_room
## Canceled :11885 Min. : 0
## Not_Canceled:24390 1st Qu.: 80
## Median : 99
## Mean :103
## 3rd Qu.:120
## Max. :540
##
Summary:
The average price per room in the dataset is 103.4 euros, with a median
of 99 euros, but prices can reach up to 540 euros, indicating
high-priced outliers. Some entries even show an average price of zero,
possibly reflecting promotional deals. Guests typically stay for two
weekday nights and one weekend night, with the average number of weekday
nights being 2.2 and weekend nights 0.81. Stays can extend to as many as
17 weekday nights. Most bookings involve two adults, and many guests do
not bring children. Lead times vary significantly, with an average of 85
days, a median of 57, and some bookings made up to 443 days in advance,
suggesting a right-skewed distribution. The data also shows sparse
previous cancellations, with an average of just 0.02, and a maximum of
58. Bookings are spread over 2017 and 2018, peaking in August.
Additionally, most guests do not make special requests, as the median is
zero, although some make up to five requests per booking.
print(ggplot(hotel_data_clean, aes(x = booking_status)) +
geom_bar(fill = "lightblue") +
labs(title = "Booking Status Distribution", x = "Booking Status", y = "Count"))
Summary
The dataset contains 36,275 bookings, with 24,390 classified as “Not
Canceled” and 11,885 as “Canceled”. Approximately two-thirds of the
bookings were completed, while the remaining one-third were
canceled.
print(ggplot(hotel_data_clean, aes(x = type_of_meal_plan, fill = booking_status)) +
geom_bar(position = "fill") +
labs(title = "Cancellation Rate by Meal Plan", x = "Meal Plan", y = "Proportion", fill = "Booking Status"))
Summary Meal Plan 1 has the highest cancellation rate, where more than half of the bookings are canceled. Meal Plan 2 has a lower cancellation rate, with cancellations and non-cancellations being nearly equal. Meal Plan 3 shows the highest proportion of non-cancelled bookings compared to canceled ones, suggesting customers choosing this meal plan tend to cancel less frequently. For the “Not Selected” category, there is a relatively high cancellation rate, similar to Meal Plan 1.
print(ggplot(hotel_data_clean, aes(x = room_type_reserved, fill = booking_status)) +
geom_bar(position = "fill") +
labs(title = "Cancellation Rate by Room Type", x = "Room Type", y = "Proportion", fill = "Booking Status"))
Summary
Room Type 1 and Room Type 6 have the highest cancellation rates, with a
large proportion of canceled bookings. Room Type 7 stands out as having
the highest proportion of non-cancelled bookings, suggesting it may be a
preferred or better-secured type of room. For other room types, the
rates are fairly similar, with around 60-70% of bookings not canceled,
except for Room Type 1 which has more cancellations.
print(ggplot(hotel_data_clean, aes(x = factor(no_of_special_requests))) +
geom_bar(fill = "orange") +
labs(title = "Number of Special Requests", x = "Number of Special Requests", y = "Count") +
theme_minimal())
Summary Most guests make no special requests, with a sharp decline as the number increases. A significant portion makes one request, while two or more requests are increasingly rare.
contingency_table <- table(hotel_data_clean$market_segment_type, hotel_data_clean$booking_status)
plot_data <- as.data.frame(contingency_table)
colnames(plot_data) <- c("Market_Segment", "Booking_Status", "Count")
print(ggplot(plot_data, aes(x = Market_Segment, y = Count, fill = Booking_Status)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Booking Status by Market Segment",
x = "Market Segment",
y = "Count",
fill = "Booking Status") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)))
ggplot(hotel_data_clean, aes(x = lead_time)) +
geom_histogram(binwidth = 30, fill = "blue", color = "black") +
theme_minimal() +
ggtitle("Distribution of Lead Time")
ggplot(hotel_data_clean, aes(y = avg_price_per_room)) +
geom_boxplot(fill = "orange") +
theme_minimal() +
ggtitle("Boxplot of Average Price per Room")
Null Hypothesis (H₀): There is no significant
association between special requests and booking status.
Alternative Hypothesis (H₁:)There is significant
association between special requests and booking status.
hotel_data_clean <- hotel_data_clean %>%
mutate(has_special_requests = ifelse(no_of_special_requests > 0, "Yes", "No"),
is_cancelled = ifelse(booking_status == "Canceled", "Canceled", "Not_Canceled"))
special_requests_table <- table(hotel_data_clean$has_special_requests, hotel_data_clean$is_cancelled)
sum(special_requests_table) > 0
## [1] TRUE
special_requests_test <- chisq.test(special_requests_table)
special_requests_props <- prop.table(special_requests_table, margin = 1)
Contingency Table:
print(special_requests_table)
##
## Canceled Not_Canceled
## No 8545 11232
## Yes 3340 13158
Chi-Square Test Results:
print(special_requests_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: special_requests_table
## X-squared = 2152, df = 1, p-value <2e-16
print(ggplot(hotel_data_clean, aes(x = has_special_requests, fill = is_cancelled)) +
geom_bar(position = "fill") +
labs(title = "Cancellation Rates by Special Requests",
x = "Has Special Requests", y = "Proportion") +
theme_minimal())
The chi-square test result shows a very small p-value which is much less than the typical significance level of 0.05. Due to this we reject the null hypothesis (H₀) and accept the alternative hypothesis (H₁).
Bookings with special requests have a much lower cancellation rate (20.2%) compared to those without (43.2%). The data strongly suggests that customers who make special requests are more likely to follow through with their bookings and significantly less likely to cancel their bookings.
Null Hypothesis (H₀): There is no association
between having previous cancellations and the likelihood of canceling
the current booking.
Alternative Hypothesis (H₁): Guests with no previous
cancellations are less likely to cancel their current booking.
hotel_data_clean <- hotel_data_clean %>%
mutate(has_previous_cancellations = ifelse(no_of_previous_cancellations > 0, "Yes", "No"))
previous_cancellations_table <- table(hotel_data_clean$has_previous_cancellations, hotel_data_clean$is_cancelled)
previous_cancellations_test <- chisq.test(previous_cancellations_table)
previous_cancellations_props <- prop.table(previous_cancellations_table, margin = 1)
Contingency Table:
print(previous_cancellations_table)
##
## Canceled Not_Canceled
## No 11869 24068
## Yes 16 322
Chi-Square Test Results:
print(previous_cancellations_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: previous_cancellations_table
## X-squared = 120, df = 1, p-value <2e-16
print(ggplot(hotel_data_clean, aes(x = has_previous_cancellations, fill = is_cancelled)) +
geom_bar(position = "fill") +
labs(title = "Cancellation Rates by Previous Cancellations",
x = "Has Previous Cancellations", y = "Proportion") +
theme_minimal())
Due to the p-value being below the significance threshold of 0.05, we reject the null hypothesis. This provides strong statistical evidence to support the alternative hypothesis that guests with previous cancellations are less likely to cancel their current booking.
Surprisingly, bookings from guests with previous cancellations have a much lower cancellation rate (4.73%) compared to those without previous cancellations (33.03%). This represents a substantial difference of about 28.3 percentage points.
# Convert booking_status to a binary variable
hotel_data_clean$booking_status_binary <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)
# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}
# Remove columns with zero variance
data_filtered <- remove_zero_variance(select_if(hotel_data_clean, is.numeric))
# Calculate the correlation matrix
cor_data <- cor(data_filtered, use = "complete.obs")
# Visualize the correlation matrix
corrplot(cor_data, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Entire Dataset", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
The correlation analysis reveals key insights into factors linked to booking cancellations. Lead time has the strongest positive correlation (0.44), indicating that bookings made further in advance are more likely to be canceled. The number of special requests shows a negative correlation (-0.25), suggesting that personalized arrangements reduce cancellation likelihood. Average price per room has a weak positive correlation (0.14), implying that higher-priced bookings may be slightly more prone to cancellations. Repeated guests are slightly less likely to cancel (-0.11). Factors like the number of adults, children, and length of stay show minimal impact.
Seasonal correlation analysis was conducted to explore how booking behavior varies across spring, summer, fall, and winter. Understanding seasonal patterns in lead time, pricing, and cancellation rates can help customize strategies to reduce cancellations throughout the year. Overall, lead time and customer commitment stand out as key determinants of cancellations.
# Define seasons based on arrival month
hotel_data_clean$season <- case_when(
hotel_data_clean$arrival_month %in% c(3, 4, 5) ~ "Spring",
hotel_data_clean$arrival_month %in% c(6, 7, 8) ~ "Summer",
hotel_data_clean$arrival_month %in% c(9, 10, 11) ~ "Fall",
hotel_data_clean$arrival_month %in% c(12, 1, 2) ~ "Winter",
TRUE ~ "Unknown"
)
# Convert booking_status to a binary variable
hotel_data_clean$booking_status_binary <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)
# Create subsets for each season
spring_data <- subset(hotel_data_clean, season == "Spring")
summer_data <- subset(hotel_data_clean, season == "Summer")
fall_data <- subset(hotel_data_clean, season == "Fall")
winter_data <- subset(hotel_data_clean, season == "Winter")
# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}
# Remove zero variance columns for each season
spring_data_filtered <- remove_zero_variance(select_if(spring_data, is.numeric))
summer_data_filtered <- remove_zero_variance(select_if(summer_data, is.numeric))
fall_data_filtered <- remove_zero_variance(select_if(fall_data, is.numeric))
winter_data_filtered <- remove_zero_variance(select_if(winter_data, is.numeric))
# Calculate and visualize correlation for Spring
cor_spring <- cor(spring_data_filtered, use = "complete.obs")
corrplot(cor_spring, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Spring", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Summer
cor_summer <- cor(summer_data_filtered, use = "complete.obs")
corrplot(cor_summer, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Summer", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Fall
cor_fall <- cor(fall_data_filtered, use = "complete.obs")
corrplot(cor_fall, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Fall", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Winter
cor_winter <- cor(winter_data_filtered, use = "complete.obs")
corrplot(cor_winter, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Winter", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
Fall Correlation Matrix:
The strongest positive correlation with booking status (0.54) is lead
time, suggesting that as the lead time increases, there is a higher
chance of the booking being canceled. This makes sense, as bookings made
far in advance may have a higher likelihood of being reconsidered or
cancelled.
Number of special requests has a slight negative correlation with
booking status (-0.23), indicating that bookings with more special
requests tend to have a lower likelihood of cancellation.
Summer Correlation Matrix:
Lead time continues to show a strong positive correlation (0.43) with
booking status, compared to other variables, meaning that longer lead
times are associated with cancellations during the summer as well.
Number of special requests shows a moderate negative correlation
(-0.30), reinforcing the idea that bookings with special requests are
less likely to be cancelled.
Average price per room has a weak positive correlation (0.22),
suggesting that higher-priced rooms might be slightly more likely to be
cancelled in the summer.
Spring Correlation Matrix:
The positive correlation with lead time remains somewhat significant at
0.29, consistent with the previous seasons. Longer lead times are
associated with a higher likelihood of cancellation.
Special requests have the strongest negative correlation (-0.36) in the
spring, indicating that bookings with more special requests are much
less likely to be canceled during this season.
Average price per room shows a small positive correlation (0.13) with
booking status, indicating a slight tendency for higher-priced bookings
to be cancelled.
Winter Correlation Matrix:
Lead time remains a positive trending factor, with a correlation of
0.24. Longer lead times during the winter continue to be associated with
higher cancellation rates.
The number of special requests has a negative correlation (-0.14),
suggesting that special requests still reduce the likelihood of
cancellations in the winter, although the effect is less pronounced
compared to other seasons.
Average price per room has a slightly stronger positive correlation
(0.36), indicating that higher-priced rooms may have a higher chance of
being cancelled during the winter season.
Null Hypothesis (H₀): There is no significant difference in
cancellation rates based on the type of meal plan.
Alternative Hypothesis (H₁): There is a significant difference
in cancellation rates based on the type of meal plan.
hotel_data_clean$type_of_meal_plan <- as.factor(hotel_data_clean$type_of_meal_plan)
hotel_data_clean$booking_status <- as.factor(hotel_data_clean$booking_status)
contingency_table <- table(hotel_data_clean$type_of_meal_plan, hotel_data_clean$booking_status)
chi_test_result <- chisq.test(contingency_table)
print(chi_test_result)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 278, df = 3, p-value <2e-16
Null Hypothesis (H₀): There is no association between booking
status and type of meal plan.
Alternative Hypothesis (H₁): There is an association between
booking status and type of meal plan.
hotel_data_clean$canceled <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)
# Subsetting the data for customers who canceled
df_canceled <- subset(hotel_data_clean, canceled == 1)
# Subsetting the data for customers who did not cancel
df_not_canceled <- subset(hotel_data_clean, canceled == 0)
ttestlead_time <- t.test(df_canceled$lead_time, df_not_canceled$lead_time,
alternative = "two.sided", conf.level = 0.95)
ttestlead_time
##
## Welch Two Sample t-test
##
## data: df_canceled$lead_time and df_not_canceled$lead_time
## t = 81, df = 16886, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 78.3 82.2
## sample estimates:
## mean of x mean of y
## 139.2 58.9
Null Hypothesis (H₀): There is no significant difference in
cancellation rates based on the Average Room Price.
Alternative Hypothesis (H₁): There is a significant difference
in cancellation rates based on the Average Room Price.
# Creating a binary variable for cancellation
hotel_data_clean$canceled <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)
# Subsetting the data for customers who canceled
df_canceled <- subset(hotel_data_clean, canceled == 1)
# Subsetting the data for customers who did not cancel
df_not_canceled <- subset(hotel_data_clean, canceled == 0)
# Performing a t-test to compare average price per room between those who canceled and those who didn't
ttest_avg_price_per_room <- t.test(
df_canceled$avg_price_per_room,
df_not_canceled$avg_price_per_room,
alternative = "two.sided",
conf.level = 0.95
)
# Function to interpret the t-test results
interpret_ttestavg_price_per_room <- function(ttest_result) {
p_value <- ttest_result$p.value # Extract the p-value from the t-test result
# Print the test summary
print(ttest_result)
# Interpret based on the p-value threshold (commonly set at 0.05)
if (p_value < 0.05) {
cat("Conclusion:\n")
cat("There is a statistically significant difference in the average price per room between customers who canceled and those who did not.\n")
cat("The p-value is less than 0.05, meaning the observed difference in means is unlikely to have occurred by chance.\n")
} else {
cat("Conclusion:\n")
cat("There is no statistically significant difference in the average price per room between customers who canceled and those who did not.\n")
cat("The p-value is greater than 0.05, so we fail to reject the null hypothesis that the difference in means is zero.\n")
}
}
# Call the function with the t-test result
interpret_ttestavg_price_per_room(ttest_avg_price_per_room)
##
## Welch Two Sample t-test
##
## data: df_canceled$avg_price_per_room and df_not_canceled$avg_price_per_room
## t = 28, df = 25929, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 9.92 11.39
## sample estimates:
## mean of x mean of y
## 110.6 99.9
##
## Conclusion:
## There is a statistically significant difference in the average price per room between customers who canceled and those who did not.
## The p-value is less than 0.05, meaning the observed difference in means is unlikely to have occurred by chance.
2.Cancellation rate for bookings without special requests: 8545 / (8545 + 11232) ≈ 43.2%. Cancellation rate for bookings with special requests: 3340 / (3340 + 13158) ≈ 20.2%. There’s a clear difference in cancellation rates, with bookings having special requests showing a significantly lower cancellation rate.